Audio coding

ABSTRACT

An audio encoding scheme or a stream that encodes audio and video data is disclosed. The scheme has particular application in mezzanine-level coding in digital television broadcasting. The scheme has a mean effective audio frame length F that equals the video frame length  1 /f V  over an integral number M video frames, by provision of audio frames variable in length F in a defined sequence where length=F(j) at encoding. The length of the audio frames may be varied by altering the length of overlap between adjacent frames in accordance with an algorithm that repeats after a sequence of M frames. An encoder and a decoder for such a scheme are also disclosed.

BACKGROUND TO THE INVENTION

1. Field of the Invention

This invention relates to coding of audio signals into a data streamsuch that it can be edited at points synchronised to another datastream. It has particular, but not exclusive, application to a digitaltelevision transmission scheme requiring non-destructive splicing of theaudio in the compressed domain at the associated video frame boundaries.

Digital Television (DTV) systems allow several programmes to bebroadcast over a channel of limited bandwidth. Each of these programmeshas video and audio content. Some of these programmes may contain highquality multichannel audio (e.g., 5 channels that can be reproduced byhome cinema systems). DTV production sites, networks and affiliatestypically use video tape recorders and transmission lines for carryingall audio content. Much of this infrastructure has capacity for only twouncompressed audio channels, so multiple channels are normally lightlycompressed and formatted before recording or transmission. Prior toemission (i.e., broadcasting to end-user) the programme streams arestrongly compressed.

In contribution and distribution stages of DTV production, originalstreams must be spliced for programme editing or programme switching(e.g., for insertion of local content into a live network feed). Suchsplicing is performed at video frame boundaries within the contentstream.

The audio content of the broadcast stream must meet severalrequirements. DTV viewers may expect received programmes to have a highperceptive audio quality, particularly when the programmes are to bereproduced using high quality reproduction equipment such as in a homecinema system. For example, there should be no audible artefacts due tocascading of multiple encoding and decoding stages, and there should beno perceptible interruption in sound during programme switching. Mostimportantly, the reproduced programmes must be lip sync; that is to saythe audio stream must be synchronous with the corresponding videostream. To achieve these ends at a reasonable cost, i.e., using theexisting (2-channel) infrastructure, one must splice the audio programmein the compressed domain.

2. Summary of the Prior Art

An existing mezzanine encoding scheme include Dolby E (r.t.m.) definedin Dolby Digital Broadcast Implementation Guidelines Part No. 91549,Version 2 1998 of Dolby Laboratories for distribution of up to 8channels of encoded audio and multiplexed metadata through an AES-3pair. The soon to be introduced (NAB 1999) DP571 Dolby E Encoder andDP572 Dolby E Decoder should allow editing and switching of encodedaudio with a minimum of mutes or glitches. Moreover, they allowcascading without audible degradation. Dolby E uses 20-bit sample sizeand provides a reduction between 2:1 and 5:1 in bitrate.

The British Broadcasting Corporation and others are proposing, throughthe ACTS ATLANTIC project, a flexible method for switching and editingof MPEG-2 video bitstreams. This seamless concatenation approach usesdecoding and re-encoding with side information to avoid cascadingdegradation. However, this scheme is limited to application with MPEG-2Layer II and the AES/EBU interface. Moreover, the audio data is allowedto slide with respect to edit points introducing a time offset.Successive edits can result, therefore, in a large time offset betweenthe audio and video information.

Throughout the broadcasting chain, video and audio streams must bemaintained in lip sync. That is to say, the audio must be keptsynchronous to the corresponding video. Prior to emission, distributionsites may splice (e.g., switch, edit or mix) audio and video streams(e.g., for inclusion of local content). After splicing, if video andaudio frame boundaries do not coincide, which is the case for most audiocoding schemes, it is not possible to automatically guarantee lip syncdue to slip of the audio with respect to the video. In extreme cases,when no special measures are taken, this could lead to audio artefacts,such as mutes or glitches. Glitches may be the result of an attempt todecode a not compliant audio stream while mutes may be applied to avoidthese glitches. An aim of this invention is to provide an encodingscheme for an audio stream that can be spliced without introducing audioartefacts such as mutes, glitches or slips.

Another aim of this invention is to provide an encoding scheme that canbe subject to cascading compression and decompression with a minimalloss of quality.

SUMMARY OF THE INVENTION

From a first aspect, the invention provides an audio encoding scheme fora stream that encodes audio and video data, which scheme has a meaneffective audio frame length {overscore (F)} that equals the video framelength 1/f_(V) over an integral number M video frames, by provision ofaudio frames variable in length F in a defined sequence F(j) atencoding.

This scheme ensures that the stream can be edited at least at each videoframe without degradation to the audio information. Preferably, theframe length F may be adjusted by varying an overlap O betweensuccessive audio frames.

In schemes embodying the invention, the value F(j) may repeatperiodically on j, the periodicity of F(j) defining a sequence offrames. There is typically M video and N audio frames per sequence, eachaudio frame being composed of k blocks. The total overlap O_(T) betweenframes in the sequence may be, for example, equal to O_(T)=p×O+q×(O+1),where O is an overlap length in blocks.

In one scheme within the scope of the invention, only audio framescorresponding to a particular video frame are overlapped. In such ascheme, the values of p and q may meet the following equalities:p=(N−M)×(O+1)−O_(T) and q=(N−M)−p.

In an alternative scheme, only audio frames corresponding to aparticular video sequence are overlapped. In such a scheme, the valuesof p and q may meet the following equalities: p=(N−1)×(O+1)−O_(T) andq=(N−1)−p.

In a further alternative scheme, any adjacent audio frames areoverlapped. In such a preferred scheme, the values of p and q may meetthe following equalities: p=N×(O+1)−O_(T) and q=N−p. This latter schememay provide optimal values of overlap for a sequence of video frames Msuch that ∃n∈

⁺: ${n \times t} = {M \times {\left( \frac{f_{A}}{f_{V}} \right).}}$

We define a video sequence as an integer (and possible finite) number ofvideo frames (i.e., M) at a rate of f_(V) video frames per second, eachvideo frame containing an equal integer number N of (compressed) audioframes, each audio frame containing an integer number k of blocks, eachblock representing an integer number t of audio samples at a samplingrate of f_(A) samples per second. By making the remainder of thedivision between the number of video frames times the quotient betweenaudio and video frequencies, and the number of audio samples per blockof (compressed) audio equal to zero, M is guaranteed to be an integer.Thus, N is also an integer. Consequently, the total number ofoverlapping blocks is also an integer and so is each single overlap.That the number of overlapping blocks is an integer is, in most cases, arequirement. Blocks of samples are the smallest units of informationhandled by the underlying codec.

From a second aspect, the invention provides an audio encoding schemefor a stream that carries encoded audio and video data in which schemeaudio samples of N quasi video-matched frames are encoded in frames witha semi-variable overlap whereby the effective length of the audio framescoincides with the length of a sequence of M video frames, where M and Nare positive integers.

The invention provides a data stream encoded by a scheme accordingeither preceding aspect of the invention. Such a stream may includeaudio frames, each of which is tagged to indicate the size of the audioframe. The blocks may be similarly tagged to indicate whether the blockis a redundant block.

From another aspect this invention provides an audio encoder (that maybe implemented for example as a software component or a hardwarecircuit) for encoding an audio stream according to the first aspect ofthe invention; and it further provides an audio decoder for decoding anaudio stream according to the first aspect of the invention.

An audio decoder according to this aspect of the invention operate bymodifying the redundancy status of blocks in the data stream byapplication of one or more of a set of block operators to each block.This may be accomplished by a set of operators that includes one or moreof: NOP, an operator that does not change the status of a block; DROP,an operator that changes the first non-redundant block from the headoverlap into a redundant block; APPEND, an operator that changes thefirst redundant block from the tail overlap into a non-redundant block;and SHIFT, an operator that is a combination of both DROP and APPENDoperators.

In particular, the invention provides an audio encoder for coding audiofor a stream that encodes audio and video data in which the encoderproduces audio frames of variable length such that a mean effectiveaudio frame length {overscore (F)} equals the video frame length 1/f_(V)over an integral number M video frames, by provision of audio framesvariable overlap to have length F in a defined sequence F(j) atencoding.

Such an audio encoder may code a stream to have a short overlap oflength O and a total of q long overlaps in a sequence, the encodercalculating the head overlap using an algorithm that repeats after Naudio frames.

From a further aspect, the invention provides an audio decoder (that maybe implemented for example as a software component or a hardwarecircuit) for decoding a stream that carries encoded audio and videodata, which decoder calculates an expected frame length of an incomingframe F in a, possibly circular shifted, sequence F(j), adjusts theactual length of the incoming frame to make it equal to the expectedframe length, determines whether any block within a received frame is aredundant block or a non-redundant block, mapping the non-redundantblocks onto sub-band audio samples.

In systems embodying the invention, there is typically no extramanipulation of the audio, such as sample rate conversion. Moreover, allinformation needed to correctly decode the received stream is mosttypically added at the encoder and there is no need to modify thisinformation during editing. Therefore, editing may be done using theexisting infrastructure with no modifications. Furthermore, very littleextra information need be added to the stream in order to make decodingpossible. Last, but not least, when using MPEG as the emission format,it may be convenient to also use an MPEG-like format for transmission.

DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT OF THE INVENTION

An embodiment of the invention will now be described in detail, by wayof example only, and with reference to the accompanying drawings, inwhich:

FIG. 1 is a diagram of a typical chain involved in DTV broadcasting;

FIG. 2 is a diagram showing the principal components of a typical DTVproduction site;

FIG. 3 is a diagram showing the principal components of a typical DTVnetwork site;

FIG. 4 is diagram that shows the arrangement of audio and video frameswithin a stream encoded in accordance with a first approach in anembodiment of the invention;

FIG. 5 is diagram that shows the arrangement of audio and video frameswithin a stream encoded in accordance with a second approach in anembodiment of the invention;

FIG. 6 is diagram that shows the arrangement of audio and video frameswithin a stream encoded in accordance with a third approach in anembodiment of the invention;

FIG. 7 shows the bit allocation of a stream embodying the invention,based on MPEG-2 Layer II, for NTSC and 48 kHz audio in IEC61937; and

FIG. 8 is a diagram of the arrangement of blocks in a stream encoded byan embodiment of the invention.

In the following description, the following symbols are used throughout:

-   f_(A), f_(V) audio sampling frequency, video frame rate-   t_(A), t_(V) audio, video frame duration length-   s samples per audio frame-   k blocks of samples per audio frame-   t samples per block-   O, O_(T), {overscore (O)} short, total and average overlap-   M, N quantity of video, audio frames per sequence-   p quantity of short overlaps per sequence-   q quantity of long overlaps per sequence-   j frame index-   F(j), G(j) frame's effective length-   H(j),T(j) frame's head, tail overlap-   X(j), {overscore (X)}(j) accumulated effective length, accumulated    mean effective length-   {overscore (F)} mean effective length-   b short frame's length-   B total number of blocks in video sequence-   φ Phase-   ⁺{1,2,3, . . . ,∞}-   Q null padding-   A(j) append operation toggle-   OP(j) Operator-   ε(j) synchronisation error-   δ total synchronisation error-   u, v auxiliary variables

With reference first to FIG. 1, a typical DTV broadcasting system is achain involving a contribution stage 10, a distribution stage 12 and anemission stage 14.

In the contribution stage, content is originated at one or moreproduction sites 20, and transferred by a distribution network 22 to abroadcast network site 24. The broadcast network 24 produces a programmestream that includes the content, and distributes the programme streamover a distribution network 30 to affiliates, such as a direct-to-homesatellite broadcaster 32, a terrestrial broadcaster 34, or a cabletelevision provider 36. A subscriber 40 can then receive the programmestream from the output of one of the affiliates.

Within the production site, content of several types may be produced andstored on different media. For example, a first studio 50 may producelive content and a second studio 52 may produce recorded content (e.g.commercial advertisements). In each case, the content includes a videoand an audio component. Output from each studio 50 is similarlyprocessed by a respective encoder 54 and to generate an elementarystream that encodes the audio and video content. The content from thefirst studio 50, to be broadcast live, is then transmitted to thedistribution network 22 by a radio link (after suitable processing).Time is not critical for the content of the second studio, so this maybe recorded on tape 56 and sent to the distribution network 22 in anappropriate manner. The encoder 54, and the elementary stream that itproduces, are embodiments of aspects of the invention.

Within the network site 24, as shown in FIG. 3, content from varioussources is spliced to construct a programme output by a splicer 60.Input to the splicer 60 is derived from elementary streams of similartypes that can be derived from various sources such as via a radio linkfrom the production unit 20, a tape 56 or a local studio 64. Output ofthe splicer 60 is likewise an elementary stream that, at any given time,is a selected one of the input streams. The splicer 60 can be operatedto switch between the input streams in a manner that ensures that theaudio and video components of the output stream can be seamlesslyreproduced. Output of the splicer 60 is then processed by a packetiser62 to form a transport stream. The transport stream is then modulatedfor transmission by a radio link to the affiliates for distribution tosubscribers.

The video content encoded within an elementary stream embodying theinvention will typically comprise a sequence of scanned video frames.Such frames may be progressive scanning video frames, in which case,each frame is a complete still picture. In such cases, the video frameshave a frame rate f_(V) and is of duration t_(V)=1/f_(V). Alternatively,the frames may be interlaced scanning frames in which each frame isbuilt up from two successive interlaced fields, the field frequencybeing 2f_(V) in the notation introduced above. The frame rate andscanning type is defined by the television system for which the streamis intended. Basic TV standards PAL and NTSC derived the frame ratesfrom the mains frequency of the countries where the standards were used.With the introduction of colour, NTSC was modified by a factor1000/1001. Additionally, film uses 24 Hz, which may be modified by thesame factor. Moreover, computer monitors can run at several frame ratesup to 96 Hz. Typical values of f_(V) are given in Table 1, below. TABLE1 Video frame rate [Hz] t_(V) [ms] Application 23.976 41.71 3-2pull-down NTSC 24 41.67 film 25 40 PAL, SECAM 29.97 33.37 NTSC, PAL-M,SECAM-M 30 33.33 drop-frame NTSC 50 20 double-rate PAL 59.94 16.68double-rate NTSC 60 16.67 double-rate, drop-frame NTSC

The audio signal is a time-continuous pulse-code modulated (PCM) signalsampled at a frequency f_(A), for example, 48 kHz. Example values off_(A) are given in Table 2, below. TABLE 2 Audio sampling frequency[kHz] Application 24 DAB 32 DAT, DBS 44.1 CD, DA-88, DAT 48 professionalaudio, DA-88, DVD 96 DVD

Besides these frequencies, it is also possible to find 44.1 and 48 kHzmodified by a factor 1000/1001 (e.g., 44.056, 44.144, 47.952 and 48.048kHz) for conforming audio in pull-up and pull-down film-to-NTSCconversions. Additionally, for film-to-PAL conversions, a 24/25 factormay be applied (e.g., 42.336, 45.937, 46.08 and 50 kHz). Moreover, DABmay use 24 and 48 kHz; DVD-Audio may use 44.1, 88.2, 176.4, 48, 96 and192 kHz; DVD-Video may use 48 and 96 kHz. DAT is specified for 32, 44.1and 48 kHz; special versions may use also 96 kHz. Finally, compressedaudio at very low bit rates may require lower sampling frequencies(e.g., 16, 22.05 and 24 kHz).

The sample width is typically 16, 20 or 24 bits.

Before compression, the audio stream is divided in audio frames ofduration t_(A)=s/f_(A), where s is the number of samples per audio frame(e.g., in MPEG-2 Layer II s=1 152 samples; in AC-3 s=1 536 samples).Examples of frame rates used in various coding schemes are shown inTable 3, below. TABLE 3 Frame length t_(A) [ms] @ Coding scheme Use[samples] 48 kHz MPEG-1 Layer I DCC 384 8 MPEG-1 Layer II DAB, DVB,DVD-V 1,152 24 MPEG-1 Layer III ISDN, MP3 1,152 24 MPEG-2 Layer II DVB,DVD 1,152 24 MPEG-2 AAC 1,024 21.33 Dolby AC-3 DVD 1,536 32 Sony ATRACMiniDisc 512 n.a.

Inside the audio encoder, audio frames are further divided into k blocksof t samples (e.g., in MPEG-2 Layer II there are 36 blocks of 32samples). The blocks are the smallest unit of audio to be processed.This may be expressed as s=k×t. Table 4 below presents of examples offrame sub-divisions used in various coding schemes. TABLE 4 Codingscheme k × t [blocks × samples] MPEG Layer I 12 × 32 MPEG Layer II 36 ×32 MPEG Layer III  2 × 576 Dolby AC-3  6 × 256

Throughout the broadcasting chain, video and audio streams must bemaintained in lip sync. That is to say, the audio must be keptsynchronous to the corresponding video. Prior to emission, distributionsites may splice (e.g., switch, edit or mix) audio and video streams(e.g., for inclusion of local content).

After splicing, if video and audio frame boundaries do not coincide,which is the case for most audio coding schemes, it is not possible toautomatically guarantee lip sync. In extreme cases, when no specialmeasures are taken, this could lead to audio artefacts, such as mutes orslips.

Although the various embodiments of the invention can perform anencoding related to existing standards (such as MPEG-1 and MPEG-2) theembodiments are not necessarily backward compatible with these existingstandards.

Basis of the Embodiments

In the coding scheme of the present embodiment, the audio samples in Nquasi video-matched frames, with a semi-variable overlapping to coincidewith a sequence of M video frames. Upon encoding in accordance with anembodiment of the invention, each video frame contains an equal integernumber of audio frames. Therefore, editing may be done at video frameboundaries. Upon decoding, redundant samples may be discarded.

Assuming an audio frame is divided in k blocks of t samples, the totaloverlap O_(T), in blocks, may be calculated by: $\begin{matrix}{O_{T} = {\left( {k \times N} \right) - \left( {\frac{M}{t} \times \frac{f_{A}}{f_{V}}} \right)}} & {{Equation}\quad 1}\end{matrix}$where M, N, k and t are positive integers and f_(A) and f_(V), representfrequencies in Hz, are such that f_(A)/f_(V) is a rational number.

For providing cross-fade between edited audio streams within the decoderreconstruction filters, the total overlap O_(T) is chosen to coincidewith an integer number of blocks, as given by:O _(T) =p×O+q×(O+1)  Equation 2where p, q and O are non-negative integers.

Within various embodiments of the invention various approaches can beadopted for spreading the total overlap through the audio frames. Thatis, by imposing different restrictions one may give differentimplementations for these embodiments. Three such approaches arereferred to herein as:

-   -   Approach 1—overlaps within video frame;    -   Approach 2—overlaps within sequence of video frames; and    -   Approach 3—overlap throughout the video stream.

It can be shown that Approach 3 always offers the smallest possibleoverlap between two adjacent audio frames, often with the smallestnumber of video frames per sequence. Therefore, for many applications,this approach will be preferred to the others. However, depending uponthe particular application, this may not always be the case.

Approach 1

When the overlaps exist only within one video frame, as in FIG. 4, theaverage overlap {overscore (O)}, in blocks, is given by: $\begin{matrix}{\overset{\_}{O} = \frac{O_{T}}{N - M}} & {{Equation}\quad 3}\end{matrix}$which may be implemented asp=(N−M)×(O+1 )−O _(T)  Equation 4overlaps of length O blocks andq=(N−M)−p  Equation 5overlaps of length (O+1) blocks.Approach 2When the overlaps exist only within one sequence, as in FIG. 5, theaverage overlap {overscore (O)}, in blocks, is given by: $\begin{matrix}{\overset{\_}{O} = \frac{O_{T}}{N - 1}} & {{Equation}\quad 6}\end{matrix}$which may be implemented asp=(N−1)×(O+1)−O _(T)  Equation 7overlaps of length O blocks andq=(N−1)−p  Equation 8overlaps of length (O+1) blocks.Approach 3

When the overlaps exist within sequences, as in FIG. 6, the averageoverlap {overscore (O)}, in blocks, is given by: $\begin{matrix}{\overset{\_}{O} = \frac{O_{T}}{N}} & {{Equation}\quad 9}\end{matrix}$which may be implemented asp=N×(O+1)−O _(T)  Equation 10overlaps of length O blocks andq=N−p  Equation 11overlaps of length (O+1) blocks.The overlap length O may be expressed asO=└{overscore (O)}┘  Equation 12which, for the last approach, can be written as: $\begin{matrix}{O = \left\lfloor {k - \frac{\left( \frac{f_{A}}{f_{V}} \right)}{\left( \frac{N}{M} \right) \times t}} \right\rfloor} & {{Equation}\quad 13}\end{matrix}$M is chosen to satisfy: $\begin{matrix}{\frac{N}{M} = \left\lceil \frac{\left( \frac{f_{A}}{f_{V}} \right)}{k \times t} \right\rceil} & {{Equation}\quad 15}\end{matrix}$and the rate of audio frames per video frame N/M may be written as:$\begin{matrix}{{\exists{n \in {\aleph^{+}:{n \times t}}}} = {M \times \left( \frac{f_{A}}{f_{V}} \right)}} & {{Equation}\quad 14}\end{matrix}$Cross-Fade

The reconstruction filter in an MPEG-1 decoder as defined in ISO/IEC11172 “Coding of moving pictures and associated audio for digitalstorage media at up to about 1.5 Mbit/s” Part 3: Audio (1993-08) is anoverlapping filter bank. If splicing is done in the sub-banddomain—i.e., blocks—that results on a cross-fade of about 512 audiosamples upon decoding.

Implementation of Embodiments Based on Common Coding Standards

Various encoding schemes have been considered as a basis for embodimentsof the invention. In particular, MPEG-1 and MPEG-2, Layers I and II havebeen considered, but this is by no means an exclusive list of possibleschemes. It must be said here that schemes embodying the invention usecoding schemes similar to existing standards but, due to overlapping,they deviate from these standards.

As will be familiar to those skilled in the technical field, the MPEG-2is a standard scheme for encoding multichannel audio backward compatiblewith MPEG-1. On the other hand, a non backwards compatible extension ofthe MPEG-1 standard to multichannel may offer implementation simplicity.Moreover, Layer II is more efficient than Layer I. On the other hand,Layer I offers less encoding redundancy due to its having a smallernumber of blocks. A scheme based on MPEG-1 Layer I may offer the bestcombination of low redundancy and implementation simplicity inembodiments of the invention.

MPEG-2 Layer II

When using MPEG-2 Layer II as a basis for the encoding scheme, k=36 andt=32.

Table 5 some examples of overlap sequences for various combinations ofaudio sample frequencies and video frame rates when the embodiment isbased upon Approach 1, as described above. TABLE 5 MPEG-2 Layer II andApproach 1 f_(A) p × O + q × f_(v) [Hz] [kHz] M N O_(T) {overscore (O)}(O + 1) 23.976 48 16 32 151  9.437 . . . 9 × 9 + 7 × 10 44.1 2,560 5,12037,173 14.520 . . . 1,227 × 14 + 1,333 × 15 32 24 48 727 30.291 . . . 17× 30 + 7 × 31 24 48 2 4 19  9.5 1 × 9 + 1 × 10 44.1 64 128 933 14.578 .. . 27 × 14 + 37 × 15 32 3 6 91 30.333 . . . 2 × 30 + 1 × 31 25 48 1 212 12 1 × 12 + 0 × 13 44.1 8 16 135 16.875 1 × 16 + 7 × 17 32 1 2 32 321 × 32 + 0 × 33 29.97 48 20 40 439 21.95 1 × 21 + 19 × 22 44.1 3,2006,400 83,253 26.016 . . . 3,147 × 26 + 53 × 27 32 n/a n/a n/a n/a n/a

Table 6 shows some examples of overlap sequences for diversecombinations of audio sample frequencies and video frame rates when theembodiment is based upon Approach 2, as described above. TABLE 6 MPEG-2Layer II and Approach 2 f_(v) [Hz] f_(A) [kHz] M N O_(T) {overscore (O)}p × O + q × (O + 1) 23.976 48 16 32 151  4.870 . . . 4 × 4 + 27 × 5 3264 302  4.793 . . . 13 × 4 + 50 × 5 48 96 453  4.768 . . . 22 × 4 + 73 ×5 44.1 2,560 5,120 37,173  7.261 . . . 3,779 × 7 + 1,340 × 8 32 24 48727 15.468 . . . 25 × 15 + 22 × 16 48 96 1,454 15.305 . . . 66 × 15 + 29× 16 72 144 2,181 15.251 . . . 107 × 15 + 36 × 16 24 48 2 4 19  6.333 .. . 2 × 6 + 1 × 7 10 20 95  5 19 × 5 + 0 × 6 48 96 456  4.8 19 × 4 + 76× 5 44.1 64 128 933  7.346 . . . 83 × 7 + 44 × 8 128 256 1,866  7.317 .. . 174 × 7 + 81 × 8 192 384 2,799  7.308 . . . 265 × 7 + 118 × 8 32 3 691 18.2 4 × 18 + 1 × 19 6 12 182 16.545 . . . 5 × 16 + 6 × 17 24 48 72815.489 . . . 24 × 15 + 23 × 16 25 48 1 2 12 12 1 × 12 + 0 × 13 2 4 24  83 × 8 + 0 × 9 7 14 84  6.461 . . . 7 × 6 + 6 × 7 44.1 8 16 135  9 15 ×9 + 0 × 10 72 144 1,215  8.496 . . . 72 × 8 + 71 × 9 32 1 2 32 32 1 ×32 + 0 × 33 2 4 64 21.333 . . . 2 × 21 + 1 × 22 17 34 544 16.484 . . .17 × 16 + 16 × 17 29.97 48 20 40 439 11.256 . . . 29 × 11 + 10 × 12 4080 878 11.113 . . . 70 × 11 + 9 × 12 220 440 4,829 11 439 × 11 + 0 × 1244.1 3200 6,400 83,253 13.010 . . . 6,333 × 13 + 66 × 14 6400 12,800166,506 13.009 . . . 12,680 × 13 + 119 × 14 32 30 30 79  2.724 . . . 8 ×2 + 21 × 3 60 60 158  2.677 . . . 19 × 2 + 40 × 3 90 90 237  2.662 . . .30 × 2 + 59 × 3

Table 7 shows overlap sequences for various combinations of audio samplefrequencies and video frame rates when the embodiment is based uponApproach 3, as described above. TABLE 7 MPEG-2 Layer II and Approach 3f_(v) f_(A) p × O + q × [Hz] [kHz] M N O_(T) {overscore (O)} (O + 1)23.976 48 16 32 151  4.718 . . . 9 × 4 + 23 × 5 44.1 2,560 5,120 37,173 7.260 . . . 3,787 × 7 + 1,333 × 8 32 24 48 727 15.145 . . . 41 × 15 + 7× 16 24 48 2 4 19  4.75 1 × 4 + 3 × 5 44.1 64 128 933  7.289 . . . 91 ×7 + 37 × 8 32 3 6 91 15.166 . . . 5 × 15 + 1 × 16 25 48 1 2 12  6 2 ×6 + 0 × 7 44.1 8 16 135  8.437 . . . 9 × 8 + 7 × 9 32 1 2 32 16 2 × 16 +0 × 17 29.97 48 20 40 439 10.975 1 × 10 + 39 × 11 44.1 3200 6400 83,25313.008 . . . 6,347 × 13 + 53 × 14 32 30 30 79  2.633 . . . 11 × 2 + 19 ×3MPEG-2 Layer I

When using MPEG-2 Layer I as the encoding scheme, k=12 and t=32. Byusing Approach 3, we obtain the sequences shown in Table 8. TABLE 8MPEG-2 Layer I and Approach 3 f_(A) p × O + q × f_(v) [Hz] [kHz] M NO_(T) {overscore (O)} (O + 1) 23.976 48 16 96 151 1.572 . . . 41 × 1 +55 × 2 44.1 2,560 12,800 6,453 0.504 . . . 6,347 × 0 + 6,453 × 1 32 2496 151 1.572 . . . 41 × 1 + 55 × 2 24 48 2 12 19 4.75 5 × 1 + 7 × 2 44.164 384 933 2.429 . . . 219 × 2 + 165 × 3 32 3 12 19 1.583 . . . 5 × 1 +7 × 2 25 48 1 5 0 0 5 × 0 + 0 × 1 44.1 8 40 39 0.975 1 × 0 + 39 × 1 32 14 8 2 4 × 2 + 0 × 3 29.97 48 20 100 199 1.99 1 × 1 + 99 × 2 44.1 3,20012,800 6,453 0.504 . . . 6,347 × 0 + 6,453 × 1 32 30 90 79 0.877 . . .11 × 0 + 79 × 1

It should be noted that the average redundancy is much less than is thecase when using Layer II.

MPEG-1

Another simplification that could be applied to embodiments is the useof MPEG-1 as the basis for the encoding scheme. In this case, the upperlimit of two channels (e.g., stereo) of MPEG-1 can be extended to nchannels. Therefore, each channel can have a bit allocation dependent onthe total bit availability and on audio content per channel.

Algorithms

In the following section, algorithms applicable to calculating overlapsaccording to Approach 3 will be described.

Encoding

An encoder for creating an embodiment stream creates a sequence offrames of a predetermined structure. Each frame j has the structureshown in Table 9 below, where k is the total number of blocks, H(j) isthe number of blocks in the head overlap and T(j) is the number ofblocks in the tail overlap. TABLE 9 H(j) k − [H(j) + T(j)] T(j)Note that T(j) = H(j + 1).

Knowing the value of N, O and q, the encoder may calculate the exacthead overlap using the following algorithm. while (new frame) { if(counter >= N ∥ counter == 0) { overlap = O + 1; counter = counter % N;} else overlap = O; return (overlap); counter = counter + q; }In the case of MPEG-2 Layer II, f_(V)=24 Hz and f_(A)=48 kHz, we havefrom Table 7 that N=4, O=4 and q=3. That generates the followingsequence of head overlaps: 5, 4, 5 and 5, or any circular shift thereof.

Every audio frame must be tagged to indicate its size. In theabove-described scheme, the head overlap may be only O or O+1 long.Therefore, it is possible to use a 1-bit tag to differentiate short andlong frames.

The useful size F(j) of the frame j within a video sequence is given by:F(j)=k−H(j+1)  Equation 16

Every block must be tagged to indicate its redundancy. In theabove-described scheme, the block may be only redundant or notredundant. Therefore, it is possible to use a 1-bit tag to differentiateredundant and non-redundant blocks.

Recording and Transmission

Although redundant information must be encoded, it need not all betransmitted. This saves bitrate in the transmitted stream. The minimumtotal number of blocks B_(min) to be recorded or transmitted within avideo sequence, is given by: $\begin{matrix}{B_{\min} = {{\left( {k - \left\lceil \frac{O_{T}}{N} \right\rceil} \right) \times N} + p}} & {{Equation}\quad 17}\end{matrix}$

An extra redundant block per audio frame may be needed to allow forediting the encoded stream. In this case, the maximum total number ofblocks B_(MAX), to be recorded or transmitted within a video sequence,is given by: $\begin{matrix}{B_{MAX} = {{\left( {k - \left\lfloor \frac{O_{T}}{N} \right\rfloor} \right) \times N} + p}} & {{Equation}\quad 18}\end{matrix}$

A phase φ may be defined to indicate the relative start, in blocks, ofthe encoded stream with respect to the first video frame in the videosequence. A suitable choice for φ is: $\begin{matrix}{\varphi = \left\lceil \frac{O}{2} \right\rceil} & {{Equation}\quad 19}\end{matrix}$

Moreover, the encoder generates null padding Q to complete the stream inaccordance with the IEC61937 standard. The length of padding depends notonly on the payload length but has also to take into consideration videoboundaries to avoid a cumulative error being introduced into the encodedstream.

Editing

Editing of the stream encoded in accordance with the embodiment may beperformed at video frame boundaries by adding, removing or appendingframes. The decoder corrects the errors that may be generated by editingusing information available within the decoder (such as values of f_(A)and f_(V)) or information generated by the encoder (such as size tags).No additional information need be recorded or transmitted as a result ofediting. Moreover, cross-fade at the editing point may be provided by areconstruction filter bank within the decoder.

Decoding

A decoder for decoding a stream calculates the expected useful size F(j)for the current frame j. Moreover, it reads a size tag from the incomingframe to determine the actual useful size G(j).

Blocks within an audio frame may have one of two statuses: redundant ornon-redundant. Non-redundant blocks are recorded, transmitted anddecoded into sub-band samples. Redundant blocks (such as the firstredundant block in the tail overlap) may be recorded and transmitted inorder to ease the decoding process. However, redundant blocks are neverdecoded into sub-band samples.

For modifying the status of an overlap block, four operators aredefined: NOP, DROP, APPEND and SHIFT.

-   NOP: The NOP operator does not change the status of blocks.-   DROP: The DROP operator changes the first non-redundant block from    the head overlap into a redundant block.-   APPEND: The APPEND operator changes the first redundant block from    the tail overlap into a non-redundant block.-   SHIFT. The shift operator is a combination of both DROP and APPEND    operators.

The decoding of frames in a stream embodying the invention into sub-bandsamples is referred to as mapping. Only non-redundant blocks are mappedinto sub-band samples. If the incoming frame is larger than expected,the operator DROP is applied. Conversely, if the incoming frame issmaller than expected the operator APPEND is applied. When the actualsize equals the expected size, the decoder looks to the previous frame.If the previous frame has been appended or shifted, the operator SHIFTis applied, otherwise, the incoming frame is mapped withoutmodification.

Synchronization Error

A stream embodying the invention is based upon the creation of a meaneffective audio frame length {overscore (F)} that equals the video framelength 1/f_(V) by alternation of long (i.e., tagged) and short frames ina defined sequence F(j) at encoding. The redundancy needed forreproducing the previous defined sequence F(j) of long and short framesat decoding, despite the actual length G(j) of the incoming frames afterediting, is obtained by overlapping frames at editing points. Atediting, the synchronisation error ε(j), in blocks, due to swappingframes may be expressed by $\begin{matrix}{{ɛ(j)} = {\left( {j \times \frac{p}{N}} \right) - {\left\lfloor {j \times \frac{p}{N}} \right\rfloor.}}} & {{Equation}\quad 20}\end{matrix}$At any time one may writej×p=u+N×v,  Equation 21with u∈{0,1,2, . . . ,N−1} and v∈{0,1,2, . . . ,p}. By substitution, itfollows $\begin{matrix}{{{ɛ(j)} = \frac{u}{N}},} & {{Equation}\quad 22}\end{matrix}$whence 0≦ε_(MAX)<1−1/N Upon decoding, those redundancies are discardedappropriately by using operators NOP, DROP, APPEND and SHIFT asdescribed above. Moreover, the incoming frame G(j) may be delayed by oneblock due to a DROP or SHIFT operation. Therefore, it can be shown thatthe total synchronisation error δ introduced by the process is bound, asfollows: $\begin{matrix}{{\Delta\quad t} = {\left. 0\Rightarrow{\delta \in {{\left\lbrack {0,{1 - \frac{1}{N}}} \right)\bigwedge\Delta}\quad t}} \right. = \left. {- 1}\Rightarrow{\delta \in \left\lbrack {{- 1},{- \frac{1}{N}}} \right)} \right.}} & {{Equation}\quad 23}\end{matrix}$with limits:−1≦δ_(MAX)<1  Equation 24Cascading

Several cascading levels of lossy encoding and decoding may degrade thesignal. However, the use of low compression rates at contribution anddistribution, use of metadata relating to the compressed signals andspecial techniques can be employed to keep this degradationimperceptible to the end-user. Methods applicable to MPEG encoding areknown to those working in the technical field (for example, as describedin “Maintaining Audio Quality in Cascaded Psychoacoustic Coding”, WarnerR. Th. ten Kate 101st AES Convention 1996 Nov. 8-11), which may be usedwith embodiments of the invention to maintain the quality of the audiosignal throughout the DTV broadcasting chain.

EXAMPLES OF THE INVENTION

Block Arrangement

The audio frame sequence, encoded in accordance with an embodiment ofthis invention, for film and professional audio based on MPEG-2 Layer IIand approach 3 overlaps is shown in Table 10. All possible arrangementof blocks after decoding the stream, according to another embodiment ofthis invention, are shown in FIG. 8. The parameters are as follows(referring to the list of symbols, above):

-   video frame rate f_(V)=24 Hz, video frame length t_(V)=41.67 ms;-   audio sampling frequency f_(A)=48 kHz, audio frame length t_(A)=24    ms;-   k=36 blocks, t=32 samples;-   M=2 video frames, N=4 audio frames;-   overlap: O_(T)=19 blocks, {overscore (O)}=4.75 blocks, O=4 blocks,    O+=5 blocks;-   p=1 short overlap, q=3 long overlaps;-   b=31 blocks, b+1=32 blocks;-   B_(min)=125, B_(MAX)=129, φ=2 blocks;

ε_(MAX)=0.75 block, $\delta \in \left\{ \begin{matrix}{\left. \left\lbrack {0,0.75} \right)\Leftarrow{\Delta\quad t} \right. = 0} \\{\left. \left\lbrack {{- 1},{- 0.25}} \right)\Leftarrow{\Delta\quad t} \right. = {- 1}}\end{matrix} \right.$ TABLE 10 j 1 2 3 4 H(j) 5 4 5 5 F(j) 32 31 31 31Application of the System to the IEC61937 Standard

A suitable standard for transmitting the stream embodying the invention,is the IEC61937 standard (‘Interface for non-linear PCM encoded audiobitstreams applying IEC 60958’). In the stream allocation shown in FIG.7 for the previous example:

-   -   The IEC61937 frame has a length ({fraction (16/32)})×3.072        Mbit/s/f_(V). For f_(V)=24 Hz, it corresponds to 64,000 bits.    -   The preambles: Pa=F872h, syncword 1; Pb=4E1Fh, syncword 2;        Pc=burst information; Pd=number of bits<65 536, length code.    -   Repetition period of data-burst is a number of IEC60958 frames.    -   Relative timing accuracy between audio and video after editing a        VTR tape and delays introduced by switcher systems gap determine        the minimum gap needed between two frames. This so-called        splicing gap may be obtained by means of null-frame stuffing.

This can be summarised as:

-   -   Stuffing=splicing gap+burst spacing; splicing gap=tape+switch        inaccuracy; burst spacing=4×IEC60958 “0” sub-frames, each        4096×IEC60958 frames.

-   Burst-payload: System frame=(N/M)×[System sub-frame−head overlap];    N=4; M=2; N/M=2.

If the stream embodying the invention is based on MPEG-2 Layer II for5.1 channels at 384 kbit/s the system requires at most 45,504 bits(2×[(1,152−4×32)×384/48+(2,047−4×32/1,152×2,047)×8]+0).

Instead, if the stream embodying the invention is based on an 6-channelversion of MPEG-1 Layer II at 192 kbit/s per channel for 6 channels, itwould require at most 49,152 bits (2×(1,152−4×32)×6×192/48+0). If wetake into account that the LFE channel requires only 12 samples perframe, the effective bitrate would be approximately 230 kbit/s perchannel.

1. An audio encoding scheme for a stream that carries audio and videodata, which scheme has a mean effective audio frame length {overscore(F)} that equals the video frame length 1/f_(V) over an integral numberM video frames, by provision of audio frames variable in length F in adefined sequence F(j) at encoding.
 2. An encoding scheme according toclaim 1 in which the frame length F is adjusted by varying an overlap Obetween successive audio frames.
 3. An encoding scheme according toclaim 1 or claim 2 in which the value F(j) repeats periodically on j,the periodicity of F(j) defining a sequence of frames.
 4. An encodingscheme according to claim 3 having M video and N audio frames persequence, each audio frame being composed of k blocks of t samples each.5. An encoding scheme according to claim 4 in which a total overlap OTbetween frames in the sequence is equal to O_(T)=p×O+q×(O+1), where O isan overlap length in blocks where p∈

{circumflex over ( )}q∈

{circumflex over ( )}O∈

{circumflex over ( )}O_(T)∈

.
 6. An encoding scheme according to claim 5 in which only audio framescorresponding to a particular video frame are overlapped.
 7. An encodingscheme according to claim 6 in which p=(N−M)×(O+1)−O_(T) and q=(N−M)−p.8. An encoding scheme according to claim 5 in which only audio framescorresponding to a particular video sequence are overlapped.
 9. Anencoding scheme according to claim 8 in which p=(N−1)×(O+1)−O_(T) andq=(N−1)−p.
 10. An encoding scheme according to claim 5 in which anyadjacent audio frames are overlapped.
 11. An encoding scheme accordingto claim 10 in which p=N×(O+1)−O_(T) and q=N−p.
 12. An encoding schemeaccording to any one of claims 4 to 11 in which ∃n∈

⁺: ${n \times t} = {M \times {\left( \frac{f_{A}}{f_{V}} \right).}}$ 13.An audio encoding scheme for a stream that encodes audio and video datain which scheme audio samples of N quasi video-matched frames areencoded in frames with a semi-variable overlap whereby the effectivelength of the audio frames coincides with the length of a sequence of Mvideo frames, where M and N are positive integers.
 14. A data streamencoded by a scheme according to any preceding claim.
 15. A data streamaccording to claim 14 which includes audio frames, each of which istagged to indicate the size of the audio frame.
 16. A data streamaccording to claim 14 or claim 15 which includes audio frames, eachblock of which is tagged to indicate whether or not the block is aredundant block.
 17. An audio encoder for coding audio for a stream thatcarries audio and video data in which the encoder produces audio framesof variable length such that a mean effective audio frame length{overscore (F)} equals the video frame length 1/f_(V) over an integralnumber M video and N audio frames, by provision of audio frames variableoverlap to have an effective in length F in a defined sequence F(j) atencoding.
 18. An audio encoder according to claim 17 for coding a streamhaving a short overlap of length O and a total of q long overlaps in asequence, the encoder calculating the head overlap using an algorithmthat repeats after N frames.
 19. An audio decoder for decoding a streamthat encodes audio and video data, which decoder calculates an expectedeffective frame length of an incoming frame, adjusts the actual lengthof the incoming frame to make it equal to the expected frame length,determines whether any block within a received frame is a redundantblock or a non-redundant block, mapping the non-redundant blocks ontosub-band samples.
 20. An audio decoder according to claim 19 whichmodifies the overlap status of blocks in the data stream by applicationof one or more of a set of block operators to each block.
 21. An audiodecoder according to claim 20 in which the set of operators includes oneor more of: NOP, an operator that does not change the status of ablocks; DROP, an operator that changes the first non-redundant blockfrom the head overlap into a redundant block; APPEND, an operator thatchanges the first redundant block from the tail overlap into anon-redundant block; and SHIFT, an operator that is a combination ofboth DROP and APPEND operators.